Revision History | ||
---|---|---|
Revision 4 | 2019-06-12 | LM |
Changed disk states WORKING/FAILED with UP/DOWN. | ||
Revision 3 | 2019-05-27 | LM |
Added milliseconds in alert/connect time. Added traps and mails section. Added disks verifications and recoveries. | ||
Revision 2 | 2019-05-07 | LM |
Changed values for state_reached, in_type, out_type. Defined fraction of seconds in timestamps to be of 3 digits always present, see Date and Time representation. | ||
Revision 1 | 2019-04-24 | LM |
Now ALL fields of CDR are defined and basic Abilis configuration requirement shown. | ||
Revision 0 | 2019-04-03 | LM |
Document created. |
Table of Contents
Procedures to store and retrieve Call Data Records (CDRs) for Abilis as SIP exchange for up to 4000 calls.
Abbreviations
For every terminated call Abilis must store a record, so called CDR, with the information which are relevant for the billing.
There are core aspects that are relevant:
local storage on dedicated high performance disks (SSD disks)
redundant storage (two disks)
record validity check (hash)
textual representation (csv)
apply all reasonable action to protect data correctness and integrity, considering the Abilis platform characteristics and limitation (FAT file system)
simple methods for data retrieval from the external client application (ftps or https)
records added when call is terminated
regular shutdowns (e.g. warm start) must force call termination and safe CDR saving
CDRs for ongoing calls dropped by unexpected system reboot (e.g. power failure) is left to a second step, but it's behaviour must be defined and described.
We choose the W3C profile of ISO 8601 representation (see Bibliography for external links with full details).
A concise summary is, below.
Warning | |
---|---|
The W3C profile of ISO 8601 says that the application should define the maximal number of digits for fractional part of second. Abilis goes further and specifies a fixed number of 3 decimals, always present. |
Complete date plus hours, minutes, seconds and an decimal fraction of a second YYYY-MM-DDThh:mm:ss.sssTZD (eg 1997-07-16T19:20:30.450+01:00) where: YYYY = four-digit year MM = two-digit month (01=January, etc.) DD = two-digit day of month (01 through 31) hh = two digits of hour (00 through 23) (am/pm NOT allowed) mm = two digits of minute (00 through 59) ss = two digits of second (00 through 59) sss = three digits representing milliseconds TZD = time zone designator (Z or +hh:mm or -hh:mm)
The most critical element for storage in Abilis is that the only supported file system is FAT, and it has a series of well known weaknesses when files are created, extended, deleted. On the contrary file renaming and moving does not risk of corrupting FAT even in a sudden failure, and there are ways to handle it after a reboot.
To highly reduce risks of FAT corruption and consequent data loss we already successfully adopted the trick of creating files of preallocated size and never add, extend, delete files, or at least to minimize it. We will follow this idea for CDR recording.
Another aspect to follow is the concept of "write and forget" on a per-file basis, i.e. once CDRs are written they won't be touched anymore.
Since Abilis can't guarantee medium to long term storage, the idea is to make CDR files available at short intervals (e.g. 15 minutes), so that the billing system can frequently retrieve the CDRs.
To further add a protection level we will store the CDRs on two independent disks, so that failure of one disk will not stop the system, and the presence of a record hash will permit detection of data corruption and recovery from the other non-corrupted disk. We will have a fault only if both disks are corrupted.
We will add alarms, traps, mails, for the relevant conditions.
We can describe the core aspect of the CDRs recording and handling as follow:
the system works with a set of preallocated files of fixed size, filled with ctrl-z character.
the number of files needed is determined automatically from MAX-AGE, FILE-SIZE, INTERVAL parameters.
the number of files will be increased ONLY if the actual need demands for more files, but it will be done in shots to minimise "time window of risk", and "in advance" with respect to the actual need.
the file(s) of the current time interval will not be available for retrieval.
when the current time interval terminates or the file is full, the current file will be made available for retrieval and from this moment on Abilis will not touch such files, up to the need of "cleanup".
when a file has to be cleaned (MAX-AGE expiry or .INV extension) it is wiped with ctrl-z character.
filenames and file extensions will play a key role in procedures
client will use FTP(S) to retrieve list of files and then fetch the desired one
client can't delete files, but it can rename a file to request Abilis to wipe it with ctrl-z charcter, basically to comply with privacy needs
the same files will be made available on two disks and client will access them independently
Files are created in the working directory and ftp/http must be properly configured to publish such files with the read/rename permission only.
If it is desired to leave to Abilis the full control on aging it will be enough to set only read permission.
Files are created in the working directory (e.g. E:\CDR\ and F:\CDR\)
Files are created with "hidden" attributes and wuill be removed when renamed to .CSV extension
Extensions and their meaning
File free for later use. Hidden.
The filename is not relevant. It will likely have some kind of progressive number or have some "old" filename.
File of the current period, opened for exclusive write. Hidden.
The filename contains date, daily sequence number, local time time of the beginning of the period it refers to.
File for the next period, opened for exclusive write. Hidden.
The filename contains date, daily sequence number, local time of the beginning of the period it refers to.
File containing valid data. Not hidden.
The filename preserve date, daily sequence number, local time of the beginning of the period it refers to.
File invalidated by client. Not hidden.
The client can change an extension from CSV to INV, periodically Abilis will check for INV presence to wipe them with ctrl-z and return to .FRE
Shall we include a protection to not wipe files before e certain age? E.g. guarantee that data is not wiped for e.g. 1 week ?
Filenames and their meaning
4 digit year (e.g. 2019)
2 digit month (e.g. 04)
2 digit day (e.g. 09)
4 digits daily sequence number (e.g. 0007)
2 digit hour (e.g. 18)
2 digit minutes (e.g. 15)
Date and Time is UTC to skip issues with STD<>DST change
The presence of UTC time in the filename has the main purpose to avoid file name repetition if the sequence number has to be reset for some reason, e.g. if disks are replaced or reformatted.
It can also be used to quickly identify the period to which the file refers, mainly for some kind of manual inspection.
The sequence number MUST be guaranteed to be progressive within the day at the best of possibility, therefore here is precise list of actions to be made:
when CDR service starts (at boot or after CDR-ACT:NO->YES)
scan both disk and apply the verification and eventually the corrective actions described in the specific chapter.
if a .CUR file is present
if it's YYYYMMDD is equal to current UTC date, it's sequence number must be set as the one "in use", and start regular processing (if interval is over, the file will be immediately closed to .CSV and a new CUR file started).
if it's YYYYMMDD is NOT equal to current UTC date it must be closed and renamed to CSV, and proceed as if .CUR file was not present
if a .CUR file is NOT present
scan all .CSV and .INV files having in filename the YYYYMMMDD of the current UTC date to detect the latest sequence number
start a new .CUR with the next sequence number
during normal CDR service operation the current sequence number is stored and increased in memory, without the need to scan all files present on disks when a new interval starts.
Of course the CDR service is in charge of it's integrity and it's verification or regeneration when the conditions suggest that it is no more reliable (e.g. service restart, internal failures, ...).
During normal work there are various disk activities, like listing, read, write, and so on, and every file I/O operation can terminate with an error code.
The action to be taken depends on the error code and on the procedure affected.
We can basically identify the following error typologies:
These errors occur when the conditions are not as expected.
A typical example is creation of a file that already exists, or open a file for reading but it does not exist.
In these situations it's the procedure itself that must take the corrective action.
These errors occur when the operating system or the file system meets some particular software conditions. As example we could imagine the "too many files opened".
These errors could be limited to individual files, so some part of the procedure can continue to work while the other stops.
In these situation it's the procedure itself that must take the corrective actions, as well as a carefully designed wait-and-retry attempt, up to the handling of the persistent condition.
These errors occur when the disk is having some defect.
Since disk defects can be of various type, also the reaction should be of various type.
We can basically identify the following errors:
A typical example are "bad sector" or "data errors". These errors are "limited" in the sense that only a portion of the disk can be affected, leaving other parts functional.
Dealing with these errors in fully automated procedure is procedure dependent, it may range from procedure blockage or proceed discarding unreadable data. Mind that these errors often occurs after several retries, and thus with a not negligible delay. This delay must be taken into consideration when there are serialized actions.
As example, in ACNT-CDR environment writing to a .CUR file that shows this probles could be solved by forcedly close the .CUR file and try to go on with a new .FRE file, and send and alert.
These errors occurs when the error may persist for long time until some event occurs or some action is taken.
Some of these events can be considered "normal" temporary failures (e.g. reset of a disk drive, removal/insert of the disk, etc...), some are more "abnormal" and it is not know if and when they could be solved (e.g. device timeout, drive not found, etc ...)
Errors that carry large delays, like "device timeout", must be carefully considered because they add a large delay and may destroy all the subsequent sequential activities.
In ACNT-CDR , for example, since the write to both disk is done sequentially in the same thread, the large timeout wasted on one disk can destroy the behaviour of the second disk, sending to hell the high reliability of the two disks approach.
These errors occur when the error will never be recovered until some administrative action is taken.
A typical example is "fat corrupted".
These error necessarily require some special and specific intervention. The functionality can be resumed only after the recovery activities have been taken.
As said, procedural errors and system errors has to be handled in the most reasonable way by the procedure itself.
Focusing on disk errors we can identify important healthy actions for ACNT-CDR.
The retries should be are already performed by the filesystem itself, so in general trying again would not succeed.
Le't sot forget that magnetic HD are more suitable to this kind of error then SSD.
Reaction depends on what is the action that is affected, fro example I can identify:
Try close CUR and restart with a new FRE. Send an alarm.
Also a simpler "Set disk to DOWN (failed) and stop activities on this disk." is probably acceptable.
Set disk to DOWN (failed) and stop activities on this disk.
Set disk to DOWN (failed) and stop activities on this disk.
Set disk to DOWN (failed) and stop activities on this disk.
When a disk in in DOWN state:
It should be skipped in any disk activity because it can damage the activities on the other disk
A separate procedure in a separate thread should be started to periodically "probe" if the disk is returned functional. The period can be tuned, I expect that smtg like "every 30 sec" could be a good starting point.
When DOWN (failed) is recovered, and the recovery is somehow proven to be reliable, the state is changed to IN-USE and activities restarted.
The following conditions are identified for alarm notifications, traps and mails.
Change on CDR-STATE (INACTIVE, UP, DOWN).
Change on Dx-STATE (UNUSED, UP, DOWN).
FIFO-CUR > 50% FIFO-SIZE (reset when FIFO-CUR=0)
FIFO-LOST (reset when FIFO-CUR=0)
List for discussion.
md5 (should we use a different hash algorithm?)
always present
It is computed on field values and for empty fields one character SPACE (0x20) is used.
routing, alerting, connected
always present
YYYY-MM-DDThh:mm:ss.sssTZD
always present
YYYY-MM-DDThh:mm:ss.sssTZD
present if alert state reached, otherwise empty
YYYY-MM-DDThh:mm:ss.sssTZD
present if connect reached, otherwise empty
YYYY-MM-DDThh:mm:ss.sssTZD
always present
ss (seconds)
or ss.sss (sec.msec)
or 0
ss (seconds)
or ss.sss (sec.msec)
or 0
inout, output, local, transfer (probably rules needs deeper investigation for the public exchange environment)
ctip, clus(ter), sip, iax, disa, vo, vm, mix, sl
x, clusname, username
ctip, clus(ter), sip, iax, disa, vo, vm, mix, sl
empty if no routing found
x, clusname, username
empty if no routing found
u(nknown), n(ational), i(nternational), o(perator), s(ubscriber), c(coded), h(alphanumeric)
IA5 digits
u(nknown), n(ational), i(nternational), o(perator), s(ubscriber), c(coded), h(alphanumeric)
IA5 digits
u(nknown), n(ational), i(nternational), o(perator), s(ubscriber), c(coded), h(alphanumeric)
IA5 digits
u(nknown), n(ational), i(nternational), o(perator), s(ubscriber), c(coded), h(alphanumeric)
IA5 digits
u(nknown), n(ational), i(nternational), o(perator), s(ubscriber), c(coded), h(alphanumeric)
IA5 digits
u(nknown), n(ational), i(nternational), o(perator), s(ubscriber), c(coded), h(alphanumeric)
IA5 digits
ITU hex value
ITU hex value
input, output, cp, router, others
Next fields have to be included too:
Subaddress fields are interesting since in the Abilis SIP environment they are used for username propagation.
The callid related fields instead are more important because when CTIR is asked to merge two calls in a new one (called either CALL TRANSFER or CALL MERGE), it currently closes CDRs of parent calls and start a new CDR for the child call, that in turn could be further transferred/merged.
IA5 digits
IA5 digits
IA5 digits
IA5 digits
YYYYMMDDhhmmss<ctir callid> (e.g. 201904091803040035)
always present
YYYYMMDDhhmmss<ctir callid> (e.g. 201904091803040035)
present when this call is one of the two parents of a child call
YYYYMMDDhhmmss<ctir callid> (e.g. 201904091803040035)
present when this call is child of two parent calls
YYYYMMDDhhmmss<ctir callid> (e.g. 201904091803040035)
present when this call is child of two parent calls
As previously said, transferred/merged calls collapse two calls into a single new call, so currently the CDRs of the parent calls are closed and a new CDR of the child call is started:
parent call 1, CDR_1 opened
parent call 2, CDR_2 opened
call merge/transfer
child call, CDR_3 opened
parent call 1, CDR_1 closed
parent call 12 CDR_2 closed
As alternative NEW implementation we can imagine:
CDR of parent calls are NOT closed
child call does NOT generate a CDR
a chain of new call + transfer/merge can take place, thus adding other parent calls
when the call finally terminates all the parent CDRs are closed
Complexity and development time is not known at this moment.
For a proper working of CDR there are some requirements for Abilis configuration.
Set appropriate number of clients (e.g. 10) and enable SSL for all sessions (e.g.10).
If needed, increase also MAX-USER-SES (default 2)
[19:18:42] ABILIS_CPX>d p ftp RES:Ftp ----------------------------------------------------------------------- Run DESCR:File_Transfer_Protocol_Server LOG:NO ACT:YES max-cli:10 max-ssl-cli:10 tcp-locport-c:21 tcp-locport-d:20 TOS:0-N IPSRC:127.000.000.001 IPSRCLIST:PrivateIpAdd DATA-TOUT:30 DT:300 REJ-1024:YES SAME-IP:YES SYSDRIVES:NO MAX-PWD-FAIL:4 DELAY-PWD-FAIL:5 MAX-IP-SES:NOMAX MAX-USER-SES:5 ANONYMOUS-USER:DENY REGISTERED-USER:PERMIT ANONYMOUS-HOMEDIR: [19:21:15] ABILIS_CPX>
Create a user specific for the FTPs, e.g. CDR, and enable it for FTP.
Enable user only for FTP-PROT:SSL.
[19:15:34] ABILIS_CPX>d user ------------------------+-------------+---------------------------------------- USER PWD ACT|CTIP CLUS |CHAT LDAP PPP FTP HTTP MAIL IAX SIP VO ------------------------+-------------+---------------------------------------- ... CDR *** YES # # NO NO NO YES NO NO NO NO NO ...
[19:15:54] ABILIS_CPX>d user:cdr Parameter: | Value: --------------------+---------------------------------------------------------- USER: CDR REAL-NAME: CDR ... FTP: YES FTP-HOMEDIR: FTP-PROT: SSL ...
Add the virtual paths for the two disks
[19:12:17] ABILIS_CPX>d ftp path Parameter: | Value: ------------+------------------------------------------------------------------ PATH: /CDR1/ PHYS-PATH: E:\CDR\ ------------------------------------------------------------------------------- PATH: /CDR2/ PHYS-PATH: F:\CDR\ ------------------------------------------------------------------------------- ...
Add FTP rights just for read and rename (r n) and directory listing (l), and only for PROT:SSL.
If it is desired to leave aging only to Abilis procedures set only read right.
[19:18:40] ABILIS_CPX>d ftp rights ------------------------------------------------------------------------------- ID: PATH: USER: FILE: DIR: RECUR: PROT: ------------------------------------------------------------------------------- 6 /cdr1/ CDR r--n l--- YES SSL ------------------------------------------------------------------------------- 7 /cdr2/ CDR r--n l--- YES SSL ------------------------------------------------------------------------------- ...
This setting is needed for a proper identification of call direction: input (net-pub to user), output (user to net-pub), local (user to user), transit (net-pub/private to/from net-pub/private).
Set CTIP-TYPE:USER.
[01:21:34] ABILIS_CPX>d p ctisip
...
SUB-LIFETIME:180 max-sub:100 CTIP-TYPE:USER
...
SIP users for trunks to public networks: SIP-CTIP-TYPE:NET-PUB
SIP users for connections to MEDIA GATEWAYS: SIP-CTIP-TYPE:SYS (or SIP-CTIP-TYPE:USER)
[01:18:49] ABILIS_CPX>d user:a | sip-ctip-type "user: " USER: BT_IT_MI SIP-CTIP-TYPE: NET-PUBLIC USER: BT_IT_RM SIP-CTIP-TYPE: NET-PUBLIC USER: mg1 SIP-CTIP-TYPE: SYS [01:18:51] ABILIS_CPX>
[1] ISO 8601 W3C. W3C notes on ISO 8601. Fraction of seconds is shown.
[2] ISO 8601. Most commonly used part of the ISO 8601.